{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pandas 数据结构 - Series\n",
    "\n",
    "Pandas Series 类似表格中的一个列（column），类似于一维数组，可以保存任何数据类型。\n",
    "创建Series的方式：`pandas.Series( data, index, dtype, name, copy)`\n",
    "- data: 数据\n",
    "- index: 索引，也可称为行标签\n",
    "- dtype: 数据类型，会自动判断\n",
    "- name：Series的名称\n",
    "- copy: 是否复制，默认False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "a = [1, 2, 3]\n",
    "s1 = pd.Series(a)\n",
    "s1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name                    Tom\n",
       "age                      20\n",
       "favorite_sports    football\n",
       "Name: Tom Infomation, dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=[\"Tom\",20,\"football\"]\n",
    "s1=pd.Series(a,index=[\"name\",\"age\",\"favorite_sports\"],name=\"Tom Infomation\")\n",
    "s1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pandas 数据结构 - DataFrame\n",
    "\n",
    "DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典(共同用一个索引)。\n",
    "创建方式：`pandas.DataFrame( data, index, columns, dtype, copy)`\n",
    "参数说明：\n",
    "\n",
    "- data：一组数据(ndarray、series, map, lists, dict 等类型)。\n",
    "- index：索引值，或者可以称为行标签。\n",
    "- columns：列标签，默认为 RangeIndex (0, 1, 2, …, n) 。\n",
    "- dtype：数据类型。\n",
    "- copy：拷贝数据，默认为 False。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>No.1</th>\n",
       "      <td>Tom</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No.2</th>\n",
       "      <td>Alex</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No.3</th>\n",
       "      <td>Nancy</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Name  Age\n",
       "No.1    Tom   15\n",
       "No.2   Alex   12\n",
       "No.3  Nancy   13"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = [['Tom',15],['Alex',12],['Nancy',13]]\n",
    "df = pd.DataFrame(data,index=[\"No.1\",\"No.2\",\"No.3\"],columns=['Name','Age'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Tom'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过loc属性，可以访问行数据\n",
    "df.loc[\"No.1\",'Name']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>No.1</th>\n",
       "      <td>Tom</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>No.2</th>\n",
       "      <td>Alex</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Name  Age\n",
       "No.1   Tom   15\n",
       "No.2  Alex   12"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[\"No.1\":\"No.2\",[\"Name\",\"Age\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "No.1      Tom\n",
       "No.2     Alex\n",
       "No.3    Nancy\n",
       "Name: Name, dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过列标签，访问Column\n",
    "df[\"Name\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pandas 数据清洗\n",
    "数据清洗是对一些没有用的数据进行处理的过程。\n",
    "很多数据集存在数据缺失、数据格式错误、错误数据或重复数据的情况，如果要对使数据分析更加准确，就需要对这些没有用的数据进行处理。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           PID  ST_NUM    ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT\n",
      "0  100001000.0   104.0     PUTNAM            Y            3        1  1000\n",
      "1  100002000.0   197.0  LEXINGTON            N            3      1.5    --\n",
      "2  100003000.0     NaN  LEXINGTON            N          NaN        1   850\n",
      "3  100004000.0   201.0   BERKELEY           12            1      NaN   700\n",
      "4          NaN   203.0   BERKELEY            Y            3        2  1600\n",
      "----------------------------------------------------------------------------------\n",
      "           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT\n",
      "0  100001000.0   104.0      PUTNAM            Y            3        1  1000\n",
      "1  100002000.0   197.0   LEXINGTON            N            3      1.5    --\n",
      "2  100003000.0     NaN   LEXINGTON            N          NaN        1   850\n",
      "3  100004000.0   201.0    BERKELEY           12            1      NaN   700\n",
      "4          NaN   203.0    BERKELEY            Y            3        2  1600\n",
      "5  100006000.0   207.0    BERKELEY            Y          NaN        1   800\n",
      "6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950\n",
      "7  100008000.0   213.0     TREMONT            Y            1        1   NaN\n",
      "8  100009000.0   215.0     TREMONT            Y           na        2  1800\n",
      "9  100009000.0   215.0     TREMONT            Y           na        2  1800\n"
     ]
    }
   ],
   "source": [
    "# 读取csv文件\n",
    "df = pd.read_csv('property-data.csv')\n",
    "#打印前n行(默认5行)\n",
    "print(df.head())\n",
    "print(\"----------------------------------------------------------------------------------\")\n",
    "#打印全部数据\n",
    "print(df) \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     PID  ST_NUM  ST_NAME  OWN_OCCUPIED  NUM_BEDROOMS  NUM_BATH  SQ_FT\n",
      "0  False   False    False         False         False     False  False\n",
      "1  False   False    False         False         False     False  False\n",
      "2  False    True    False         False          True     False  False\n",
      "3  False   False    False         False         False      True  False\n",
      "4   True   False    False         False         False     False  False\n",
      "5  False   False    False         False          True     False  False\n",
      "6  False    True    False          True         False     False  False\n",
      "7  False   False    False         False         False     False   True\n",
      "8  False   False    False         False         False     False  False\n",
      "9  False   False    False         False         False     False  False\n"
     ]
    }
   ],
   "source": [
    "# 查看是空的数据\n",
    "print(df.isna())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PID</th>\n",
       "      <th>ST_NUM</th>\n",
       "      <th>ST_NAME</th>\n",
       "      <th>OWN_OCCUPIED</th>\n",
       "      <th>NUM_BEDROOMS</th>\n",
       "      <th>NUM_BATH</th>\n",
       "      <th>SQ_FT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100001000.0</td>\n",
       "      <td>104.0</td>\n",
       "      <td>PUTNAM</td>\n",
       "      <td>Y</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100002000.0</td>\n",
       "      <td>197.0</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>3</td>\n",
       "      <td>1.5</td>\n",
       "      <td>--</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>na</td>\n",
       "      <td>2</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>na</td>\n",
       "      <td>2</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PID  ST_NUM    ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT\n",
       "0  100001000.0   104.0     PUTNAM            Y            3        1  1000\n",
       "1  100002000.0   197.0  LEXINGTON            N            3      1.5    --\n",
       "8  100009000.0   215.0    TREMONT            Y           na        2  1800\n",
       "9  100009000.0   215.0    TREMONT            Y           na        2  1800"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 删除所有包含na数据行\n",
    "df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PID</th>\n",
       "      <th>ST_NUM</th>\n",
       "      <th>ST_NAME</th>\n",
       "      <th>OWN_OCCUPIED</th>\n",
       "      <th>NUM_BEDROOMS</th>\n",
       "      <th>NUM_BATH</th>\n",
       "      <th>SQ_FT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100001000.0</td>\n",
       "      <td>104.0</td>\n",
       "      <td>PUTNAM</td>\n",
       "      <td>Y</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100002000.0</td>\n",
       "      <td>197.0</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>3</td>\n",
       "      <td>1.5</td>\n",
       "      <td>--</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100003000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>850</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100004000.0</td>\n",
       "      <td>201.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>203.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>100006000.0</td>\n",
       "      <td>207.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>100007000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>HURLEY</td>\n",
       "      <td>950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>na</td>\n",
       "      <td>2</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>na</td>\n",
       "      <td>2</td>\n",
       "      <td>1800</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PID  ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT\n",
       "0  100001000.0   104.0      PUTNAM            Y            3        1  1000\n",
       "1  100002000.0   197.0   LEXINGTON            N            3      1.5    --\n",
       "2  100003000.0     NaN   LEXINGTON            N          NaN        1   850\n",
       "3  100004000.0   201.0    BERKELEY           12            1      NaN   700\n",
       "4          NaN   203.0    BERKELEY            Y            3        2  1600\n",
       "5  100006000.0   207.0    BERKELEY            Y          NaN        1   800\n",
       "6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950\n",
       "8  100009000.0   215.0     TREMONT            Y           na        2  1800\n",
       "9  100009000.0   215.0     TREMONT            Y           na        2  1800"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按条件删除Na数据\n",
    "df.dropna(axis=0,how='any',subset='SQ_FT')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PID</th>\n",
       "      <th>ST_NUM</th>\n",
       "      <th>ST_NAME</th>\n",
       "      <th>OWN_OCCUPIED</th>\n",
       "      <th>NUM_BEDROOMS</th>\n",
       "      <th>NUM_BATH</th>\n",
       "      <th>SQ_FT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100001000.0</td>\n",
       "      <td>104.0</td>\n",
       "      <td>PUTNAM</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100002000.0</td>\n",
       "      <td>197.0</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100003000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>850.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100004000.0</td>\n",
       "      <td>201.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>203.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1600.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>100006000.0</td>\n",
       "      <td>207.0</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>100007000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>HURLEY</td>\n",
       "      <td>950.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>100008000.0</td>\n",
       "      <td>213.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PID  ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH   SQ_FT\n",
       "0  100001000.0   104.0      PUTNAM            Y           3.0        1  1000.0\n",
       "1  100002000.0   197.0   LEXINGTON            N           3.0      1.5     NaN\n",
       "2  100003000.0     NaN   LEXINGTON            N           NaN        1   850.0\n",
       "3  100004000.0   201.0    BERKELEY           12           1.0      NaN   700.0\n",
       "4          NaN   203.0    BERKELEY            Y           3.0        2  1600.0\n",
       "5  100006000.0   207.0    BERKELEY            Y           NaN        1   800.0\n",
       "6  100007000.0     NaN  WASHINGTON          NaN           2.0   HURLEY   950.0\n",
       "7  100008000.0   213.0     TREMONT            Y           1.0        1     NaN\n",
       "8  100009000.0   215.0     TREMONT            Y           NaN        2  1800.0\n",
       "9  100009000.0   215.0     TREMONT            Y           NaN        2  1800.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取csv文件,将'--','na','n/a'都视为空值\n",
    "df = pd.read_csv('property-data.csv',na_values=['--','na','n/a'])\n",
    "#打印全部数据\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    False\n",
      "1     True\n",
      "2    False\n",
      "3    False\n",
      "4    False\n",
      "5    False\n",
      "6    False\n",
      "7     True\n",
      "8    False\n",
      "9    False\n",
      "Name: SQ_FT, dtype: bool\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0    1000.0\n",
       "1    1000.0\n",
       "2     850.0\n",
       "3     700.0\n",
       "4    1600.0\n",
       "5     800.0\n",
       "6     950.0\n",
       "7    1000.0\n",
       "8    1800.0\n",
       "9    1800.0\n",
       "Name: SQ_FT, dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查看SQ_FT列中为空的数据\n",
    "print(df[\"SQ_FT\"].isna())\n",
    "# 填充na数据\n",
    "df[\"SQ_FT\"].fillna(1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "替换空单元格的常用方法是计算列的均值、中位数值或众数。\n",
    "Pandas使用 mean()、median() 和 mode() 方法计算列的均值（所有值加起来的平均值）、中位数值（排序后排在中间的数）和众数（出现频率最高的数）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "平均值： 194.375\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PID</th>\n",
       "      <th>ST_NUM</th>\n",
       "      <th>ST_NAME</th>\n",
       "      <th>OWN_OCCUPIED</th>\n",
       "      <th>NUM_BEDROOMS</th>\n",
       "      <th>NUM_BATH</th>\n",
       "      <th>SQ_FT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100001000.0</td>\n",
       "      <td>104.000</td>\n",
       "      <td>PUTNAM</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100002000.0</td>\n",
       "      <td>197.000</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100003000.0</td>\n",
       "      <td>194.375</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>850.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100004000.0</td>\n",
       "      <td>201.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>203.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1600.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>100006000.0</td>\n",
       "      <td>207.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>100007000.0</td>\n",
       "      <td>194.375</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>HURLEY</td>\n",
       "      <td>950.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>100008000.0</td>\n",
       "      <td>213.000</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.000</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.000</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PID   ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH  \\\n",
       "0  100001000.0  104.000      PUTNAM            Y           3.0        1   \n",
       "1  100002000.0  197.000   LEXINGTON            N           3.0      1.5   \n",
       "2  100003000.0  194.375   LEXINGTON            N           NaN        1   \n",
       "3  100004000.0  201.000    BERKELEY           12           1.0      NaN   \n",
       "4          NaN  203.000    BERKELEY            Y           3.0        2   \n",
       "5  100006000.0  207.000    BERKELEY            Y           NaN        1   \n",
       "6  100007000.0  194.375  WASHINGTON          NaN           2.0   HURLEY   \n",
       "7  100008000.0  213.000     TREMONT            Y           1.0        1   \n",
       "8  100009000.0  215.000     TREMONT            Y           NaN        2   \n",
       "9  100009000.0  215.000     TREMONT            Y           NaN        2   \n",
       "\n",
       "    SQ_FT  \n",
       "0  1000.0  \n",
       "1     NaN  \n",
       "2   850.0  \n",
       "3   700.0  \n",
       "4  1600.0  \n",
       "5   800.0  \n",
       "6   950.0  \n",
       "7     NaN  \n",
       "8  1800.0  \n",
       "9  1800.0  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean=df[\"ST_NUM\"].mean()\n",
    "print(\"平均值：\",mean)\n",
    "df[\"ST_NUM\"].fillna(mean,inplace=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "5    False\n",
       "6    False\n",
       "7    False\n",
       "8    False\n",
       "9     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查看重复数据\n",
    "df.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PID</th>\n",
       "      <th>ST_NUM</th>\n",
       "      <th>ST_NAME</th>\n",
       "      <th>OWN_OCCUPIED</th>\n",
       "      <th>NUM_BEDROOMS</th>\n",
       "      <th>NUM_BATH</th>\n",
       "      <th>SQ_FT</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100001000.0</td>\n",
       "      <td>104.000</td>\n",
       "      <td>PUTNAM</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>100002000.0</td>\n",
       "      <td>197.000</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100003000.0</td>\n",
       "      <td>194.375</td>\n",
       "      <td>LEXINGTON</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>850.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>100004000.0</td>\n",
       "      <td>201.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>203.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1600.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>100006000.0</td>\n",
       "      <td>207.000</td>\n",
       "      <td>BERKELEY</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>100007000.0</td>\n",
       "      <td>194.375</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>HURLEY</td>\n",
       "      <td>950.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>100008000.0</td>\n",
       "      <td>213.000</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>100009000.0</td>\n",
       "      <td>215.000</td>\n",
       "      <td>TREMONT</td>\n",
       "      <td>Y</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           PID   ST_NUM     ST_NAME OWN_OCCUPIED  NUM_BEDROOMS NUM_BATH  \\\n",
       "0  100001000.0  104.000      PUTNAM            Y           3.0        1   \n",
       "1  100002000.0  197.000   LEXINGTON            N           3.0      1.5   \n",
       "2  100003000.0  194.375   LEXINGTON            N           NaN        1   \n",
       "3  100004000.0  201.000    BERKELEY           12           1.0      NaN   \n",
       "4          NaN  203.000    BERKELEY            Y           3.0        2   \n",
       "5  100006000.0  207.000    BERKELEY            Y           NaN        1   \n",
       "6  100007000.0  194.375  WASHINGTON          NaN           2.0   HURLEY   \n",
       "7  100008000.0  213.000     TREMONT            Y           1.0        1   \n",
       "8  100009000.0  215.000     TREMONT            Y           NaN        2   \n",
       "\n",
       "    SQ_FT  \n",
       "0  1000.0  \n",
       "1     NaN  \n",
       "2   850.0  \n",
       "3   700.0  \n",
       "4  1600.0  \n",
       "5   800.0  \n",
       "6   950.0  \n",
       "7     NaN  \n",
       "8  1800.0  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 删除重复数据\n",
    "df.drop_duplicates()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.2 ('venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "1577a05ddfca352f4d657d7bbde71d44e58f3858e3f81e48a10db2a97469c5a1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
